Python for Bioinformatics

This Jupyter notebook is intented to be used alongside the book Python for Bioinformatics

Note: Before opening the file, this file should be accesible from this Jupyter notebook. In order to do so, the following commands will download these files from Github and extract them into a directory called samples.

Chapter 13: Regular Expressions



In [ ]:

    
!curl https://raw.githubusercontent.com/Serulab/Py4Bio/master/samples/samples.tar.bz2 -o samples.tar.bz2
!mkdir samples
!tar xvfj samples.tar.bz2 -C samples



In [ ]:

    
import re



In [ ]:

    
mo = re.search('hello', 'Hello world, hello Python!')



In [ ]:

    
mo.group()









    Out[ ]:





'hello'



In [ ]:

    
mo.span()









    Out[ ]:





(13, 18)



In [ ]:

    
'Hello world, hello Python!'.index('hello')









    Out[ ]:





13



In [ ]:

    
import re



In [ ]:

    
mo = re.search('[Hh]ello', 'Hello world, hello Python!')



In [ ]:

    
mo.group()









    Out[ ]:





'Hello'



In [ ]:

    
re.findall("[Hh]ello","Hello world, hello Python,!")









    Out[ ]:





['Hello', 'hello']



In [ ]:

    
re.finditer("[Hh]ello", "Hello world, hello Python,!")









    Out[ ]:





<callable_iterator at 0x7f90f00dc3c8>



In [ ]:

    
mos = re.finditer("[Hh]ello", "Hello world, hello Python,!")



In [ ]:

    
for x in mos:
    print(x.group())
    print(x.span())









    



Hello
(0, 5)
hello
(13, 18)



In [ ]:

    
mo = re.match("hello", "Hello world, hello Python!")
print (mo)









    



None



In [ ]:

    
mo = re.match("Hello", "Hello world, hello Python!")
mo









    Out[ ]:





<_sre.SRE_Match object; span=(0, 5), match='Hello'>



In [ ]:

    
mo.group()









    Out[ ]:





'Hello'



In [ ]:

    
mo.span()









    Out[ ]:





(0, 5)



In [ ]:

    
re.findall("[Hh]ello","Hello world, hello Python,!")









    Out[ ]:





['Hello', 'hello']



In [ ]:

    
rgx = re.compile("[Hh]ello")
rgx.findall("Hello world, hello Python,!")









    Out[ ]:





['Hello', 'hello']



In [ ]:

    
rgx = re.compile("[Hh]ello")
rgx.search("Hello world, hello Python,!")









    Out[ ]:





<_sre.SRE_Match object; span=(0, 5), match='Hello'>



In [ ]:

    
rgx.match("Hello world, hello Python,!")









    Out[ ]:





<_sre.SRE_Match object; span=(0, 5), match='Hello'>



In [ ]:

    
rgx.findall("Hello world, hello Python,!")









    Out[ ]:





['Hello', 'hello']

Listing 13.1: findTAT.py: Find the first “TAT” repeat



In [ ]:

    
import re
seq = "ATATAAGATGCGCGCGCTTATGCGCGCA"
rgx = re.compile("TAT")
i = 1
for mo in rgx.finditer(seq):
    print('Ocurrence {0}: {1}'.format(i, mo.group()))
    print('Position: From {0} to {1}'.format(mo.start(),
                                            mo.end()))
    i += 1









    



Ocurrence 1: TAT
Position: From 1 to 4
Ocurrence 2: TAT
Position: From 18 to 21



In [ ]:

    
import re
seq = "ATATAAGATGCGCGCGCTTATGCGCGCA"
rgx = re.compile("(GC){3,}")
result = rgx.search(seq)
result.group()









    Out[ ]:





'GCGCGCGC'



In [ ]:

    
result.groups()









    Out[ ]:





('GC',)



In [ ]:

    
rgx = re.compile("((GC){3,})")
result = rgx.search(seq)
result.groups()









    Out[ ]:





('GCGCGCGC', 'GC')



In [ ]:

    
# Only the inner group is non-capturing
rgx = re.compile("((?:GC){3,})")
result = rgx.search(seq)
result.groups()









    Out[ ]:





('GCGCGCGC',)



In [ ]:

    
rgx = re.compile("TAT") # No group at all.
rgx.findall(seq) # This returns a list of matching strings.









    Out[ ]:





['TAT', 'TAT']



In [ ]:

    
rgx = re.compile("(GC){3,}") # One group. Return a list
rgx.findall(seq) # with the group for each match.









    Out[ ]:





['GC', 'GC']



In [ ]:

    
rgx = re.compile("((GC){3,})") # Two groups. Return a
rgx.findall(seq) # list with tuples for each match.









    Out[ ]:





[('GCGCGCGC', 'GC'), ('GCGCGC', 'GC')]



In [ ]:

    
rgx = re.compile("((?:GC){3,})") # Using a non-capturing
rgx.findall(seq) # group to get only the matches.









    Out[ ]:





['GCGCGCGC', 'GCGCGC']

Listing 13.2: subgroups.py: Find multiple sub-patterns



In [ ]:

    
import re
rgx = re.compile("(?P<TBX>TATA..).*(?P<CGislands>(?:GC){3,})")
seq = "ATATAAGATGCGCGCGCTTATGCGCGCA"
result = rgx.search(seq)
print(result.group('CGislands'))
print(result.group('TBX'))









    



GCGCGC
TATAAG

Listing 13.3: regexsys1.py: Count lines with a user-supplied pattern on it



In [ ]:

    
import re, sys
myregex = re.compile(sys.argv[2])
counter = 0
with open(sys.argv[1]) as fh:
    for line in fh:
        if myregex.search(line):
            counter += 1
print(counter)









    



---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-2-04bc1d111216> in <module>()
      2 myregex = re.compile(sys.argv[2])
      3 counter = 0
----> 4 with open(sys.argv[1]) as fh:
      5     for line in fh:
      6         if myregex.search(line):

FileNotFoundError: [Errno 2] No such file or directory: '-f'

Listing 13.4: countinfile.py: Count the occurrences of a pattern in a file



In [ ]:

    
import re, sys
myregex = re.compile(sys.argv[2])
i = 0
with open(sys.argv[1]) as fh:
    for line in fh:
        i += len(myregex.findall(line))
print(i)









    



---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-36-40acf3b4aeac> in <module>()
      2 myregex = re.compile(sys.argv[2])
      3 i = 0
----> 4 with open(sys.argv[1]) as fh:
      5     for line in fh:
      6         i += len(myregex.findall(line))

FileNotFoundError: [Errno 2] No such file or directory: '-f'

Listing 13.5: deletegc.py: Delete GC repeats (more than 3 GC in a row)



In [ ]:

    
import re
regex = re.compile("(?:GC){3,}")
seq="ATGATCGTACTGCGCGCTTCATGTGATGCGCGCGCGCAGACTATAAG"
print ("Before:",seq)
print ("After:",regex.sub("",seq))









    



Before: ATGATCGTACTGCGCGCTTCATGTGATGCGCGCGCGCAGACTATAAG
After: ATGATCGTACTTTCATGTGATAGACTATAAG

Listing 13.6: searchinfasta.py: Search a pattern in a FASTA file



In [ ]:

    
import re
pattern = "[LIVM]{2}.RL[DE].{4}RLE"
with open('samples/Q5R5X8.fas') as fh:
    fh.readline() # Discard the first line.
    seq = ""
    for line in fh:
        seq += line.strip()
rgx = re.compile(pattern)
result = rgx.search(seq)
patternfound = result.group()
span = result.span()
leftpos = span[0]-10
if leftpos<0:
    leftpos = 0
print(seq[leftpos:span[0]].lower() + patternfound +
      seq[span[1]:span[1]+10].lower())









    



manmqgLVERLERAVSRLEslsaeshrpp

Listing 13.7: cleanseq.py: Cleans a DNA sequence



In [ ]:

    
import re
regex = re.compile(' |\d|\n|\t')
seq = ''
for line in open('samples/pMOSBlue.txt'):
    seq += regex.sub('',line)
print (seq)









    



ATGACCATGATTACGCCAAGCTCTAATACGACTCACTATAGGGAAAGCTTGCATGCCTGCAGGTCGACTCTAGAGGATCTACTAGTCATATGGATATCGGATCCCCGGGTACCGAGCTCGAATTCACTGGCCGTCGTTTT